Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
نویسندگان
چکیده
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function speciies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions deened on state-action pairs, as are used in Q-learning. One signiicant application of these results is to problems where a function approximator is used to learn a value function, with training of the approximator based on trying to minimize the Bellman residual across states or state-action pairs. When 1 control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.
منابع مشابه
Information Relaxation Bounds for Infinite Horizon Markov Decision Processes
We consider the information relaxation approach for calculating performance bounds for stochastic dynamic programs (DPs), following Brown, Smith, and Sun (2010). This approach generates performance bounds by solving problems with relaxed nonanticipativity constraints and a penalty that punishes violations of these constraints. In this paper, we study infinite horizon DPs with discounted costs a...
متن کاملOn Integral Operator and Argument Estimation of a Novel Subclass of Harmonic Univalent Functions
Abstract. In this paper we define and verify a subclass of harmonic univalent functions involving the argument of complex-value functions of the form f = h + ¯g and investigate some properties of this subclass e.g. necessary and sufficient coefficient bounds, extreme points, distortion bounds and Hadamard product.Abstract. In this paper we define and verify a subclass of harmonic univalent func...
متن کاملInformation Relaxations, Duality, and Convex Stochastic Dynamic Programs
We consider the information relaxation approach for calculating performance bounds for stochastic dynamic programs (DPs). This approach generates performance bounds by solving problems with relaxed nonanticipativity constraints and a penalty that punishes violations of these nonanticipativity constraints. In this paper, we study DPs that have a convex structure and consider gradient penalties t...
متن کاملNested performance bounds and approximate solutions for the sensor placement problem
This paper considers the placement of m sensors at n > m possible locations. Given noisy observations, knowledge of the state correlation matrix, and a mean-square error criterion (equivalently maximizing an efficacy cost criterion), the problem is formulated as an integer programming problem. Computing the solution for large m and n is infeasible, requiring us to look at approximate algorithms...
متن کاملLearning Weighted Rule Sets for Forward Search Planning
In many planning domains, it is possible to define and learn good rules for reactively selecting actions. This has lead to work on learning rule-based policies as a form of planning control knowledge. However, it is often the case that such learned policies are imperfect, leading to planning failure when they are used for greedy action selection. In this work, we seek to develop a more robust f...
متن کامل